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ABSTRACT 



Using an anchor-item design of test equating, the effects of 
three equating methods (Tucker linear and two three-parameter 
item-response-theory-based (3PL-IRT) methods) , and the content 
representativeness of anchor items on the accuracy of equating were examined; 
and an innovative way of evaluating equating accuracy appropriate for the 
particular item- sampling design of the study was introduced. Data analyzed 
were test results from 2 forms of a professional competency examination with 
197 and 203 items respectively. There were 145 anchor items embedded in both 
forms, and the 2 examinee groups were not randomly formed. From the two test 
forms, four pairs of shortened test forms were created to differ in the 
content representativeness of their anchor items. The total raw score on the 
original anchor items was regarded as a "pseudo true score, " which was used 
as a criterion for evaluating equating accuracy. Overall, the three equating 
methods appeared to yield moderately accurate equating results on every test, 
but the outcomes of the IRT-based methods seemed to be more accurate than the 
outcomes of the Tucker method. The accuracy of equating depended on the 
content representativeness of the anchor items, no matter which method was 
used to equate test forms. The 3PL-IRT model seemed appropriate for equating 
the test form with negative skewed score distribution. One appendix presents 
the item sampling schemes and the other contains tables of correlation 
analyses on anchor and nonanchor items. (Contains 6 tables, 2 figures, and 58 
references.) (SLD) 
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Using an anchor-item design of test equating, the effects of three 
equating methods (Tucker linear and two 3PL-IRT-based methods) and the content 
representativeness of anchor items on the accuracy of equating were examined 
in this study. The main goals were to investigate (a) whether equating 
accuracy improved with more content-representative anchor items, (b) whether 
the effect of the content representativeness of anchor items depended on the 
particular equating method used, and (c) relatively, which equating method 
yielded the most accurate results. An innovative way of evaluating equating 
accuracy appropriate for the particular item-sampling design of this study was 
introduced. The adequacy of using the 3 PL IRT model for equating alternate 
forms of a minimum competency test was also discussed. 

The data analyzed were test results from two forms of a professional 
competency examination that had 197 and 203 items respectively. There were 
145 anchor items embedded in both forms, and the two examinee groups were not 
randomly formed. After pooling the two test forms, four pairs of shorter test 
forms were created by sampling items from the item pool using four distinct 
item sampling schemes. These item sampling schemes resulted in tests that 
differed in the content representativeness of their anchor items, and the 
effect of anchor length was controlled. For each shorter test, the pair of 
alternate forms were equated using both the conventional linear method and the 
IRT-based methods. 

The total raw score on the 145 anchor items in the original test was 
regarded as a "pseudo true score", which was used as a criterion for 
evaluating equating accuracy. Estimated IRT true scores based on the two IRT- 
based equating and Tucker linear equating result were correlated to "pseudo 
true score" separately to study the accuracy of these equating. The Pearson 
produce moment correlation coefficient (r) was used to represent the estimated 
accuracy of equating results. 

In summary, this study found that (a) overall, the three equating 
methods appeared to yield moderately accurate equating results on every test; 
(b) however, the equating outcomes of the IRT-based methods seemed to be more 
accurate than the outcomes of Tucker method, regardless of the content 
representativeness of anchor items; (c) the two IRT-based methods yielded very 
similar equating results; (d) the accuracy of equating depended on the content 
representativeness of anchor items, no matter which method was used to equate 
test forms; and (e) the 3 PL IRT model seemed appropriate for equating the 
minimum competency test that had negative skewed score distribution. 

One important implication of these findings was, regardless of equating 
method, equating results were more likely to be accurate when anchor items 
were more representative of the total test, or the content coverage of a test 
concentrated on fewer topics. Suggestions for future research were provided 
in this paper. 
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The Effects of Content Mix and Equating Method on the Accuracy of 
Test Equating Using Anchor-Item Design 

Introduction 

In testing practice, often not all examinees take a test at the same 
occasion or take the same test. To ensure test security, there is a need for 
alternate test forms. Test forms that have comparable scores are also needed 
for measuring growth or trends of learning. The need for interchangeable 
parallel test forms is especially urgent for licensure exams and any other 
tests used to inform critical decisions. In addition to careful test 
construction, a practical strategy to arrive at comparable test scores is to 
establish equivalency between different forms via equating. 

A variety of equating techniques have been developed, including linear 
and non-linear equating. Mainly, equating models vary substantially in their 
assumptions, mathematical functions, as well as procedures required. 
Conventional linear methods, such as Tucker linear equating, are 
straightforward and convenient but their results do not always meet all 
criteria for equivalent tests. To overcome the drawbacks of conventional 
equating, equating methods based on IRT estimated scores are developed and 
used increasingly. 

IRT equating is especially useful in common-item design, where random 
assignment of examinees is not feasible and the assumptions required by 
conventional equating are likely to be violated (Cook & Eignor, 1991; Crocker 
& Algina, 198 6) . Research results have shown that IRT methods are more robust 
than conventional equating and will lead to greater stability, when tests to 
be equated differ somewhat in content and length (Petersen, Cook, & Stocking, 
1983) . Despite various appeals in theory and practice, IRT equating remains 
under scrutiny because of its sometimes inconsistent behaviors. Possible IRT 
method by test interaction also raises concerns (Hills, Subhiyah, & Hirsch, 
1988; Peterson, Cook, & Stocking, 1983). In addition, practical significance 
or value of improved accuracy achieved by IRT equating over conventional 
methods needs to be considered. 

To enhance equating accuracy, this study seeks to settle controversies 
about various equating in practice. Pairs of test forms were assembled by 
various item sampling schemes to manipulate content mix of a test, or content 
representation of anchor items embedded in the test. The test forms were then 
equated by Tucker linear method and two IRT-based equating methods, using 
anchor-item design. Various equating results were evaluated against an 
innovative criterion of equating accuracy, which is appropriate for the 
particular design of this study. Comparisons of equating results are 
presented and discussed, and suggestions are made for future research and 
equating practice. 



Research Purposes 

In search of a better understanding in the function of anchor 
characteristics in equating and the relative effectiveness of equating 
methods, this study bears specific purposes as follows: 

1. To investigate the effect of content representativeness of anchor 
items on equating accuracy, while the method of equating varies. 

2. To estimate, evaluate, and compare the accuracy of linear equating 
and IRT-based equating. 

3. To compare the equating results of two IRT equating methods (two- 
stage method and fixed-b method) that are based on different procedures. 

4. To apply an innovative criterion for evaluating equating accuracy 
that is appropriate for the particular design of this study so the 
effectiveness of various methods can be evaluated. 

5. To inform testing practice, based on the findings of this study, 
about ways to improving equating when anchor-item design is used. 
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Pursuing solutions to the issues listed above, this study is expected to 
make contributions to the improvement of test equating practice. 

Research Questions 

The research questions of this study are shaped by personal interest in 
understanding and evaluating the effectiveness of various equating methods. 
They also reflect important equating issues in practice, and they are made 
viable by the rich context of the data analyzed in this study. To achieve the 
study goals described previously, the following specific research questions 
are raised: 

1. Does equating result depend on the content representation of anchor 
items? Specifically, when the content mix of anchor items becomes more 
representative to the entire test, does the accuracy of equating improve? 

2. To what extent do the results of Tucker linear equating and the IRT 
methods agree, or vary? 

3. To what extent do the results of IRT two-stage and IRT fixed-b 
equating procedures agree, or vary? 

4. How accurate are the equating results yielded by various equating 
methods, compared against an appropriate criterion for evaluating equating 
accuracy? 

5. Is three-parameter logistic (3PL) IRT model appropriate for the 
minimum competence test, which has a negatively skewed score distribution, 
analyzed in this study? 



Literature Review 

Important equating issues, such as conditions of equating, procedures 
and assumptions of common equating methods, as well as findings from previous 
research about the merits of various equating methods are reviewed in this 
section . 

Conditions of Equivalency 

If test Y is to be equated to test X, no matter what equating procedure 
is chosen, the following conditions must be satisfied to conclude that the 
scores on test X and test Y are equivalent (Angoff, 1984; Dorans, 1990; Lord, 
1980; Petersen, Kolen, & Hoover, 1989): 

1. Both tests measure the same construct. 

2. The equating achieves equity. That is, for individuals of 
identical proficiency, the conditional frequency distributions of scores on 
the two tests are the same. 

3. The equating transformation is symmetric. That is, the equating of Y 
to X is the inverse of the equating of X to Y. 

4. The equating transformation is invariant across sub-groups of the 
population, from which it is derived. 

Equating Guidelines 

There is no absolute superior criteria to guide the selection of 
equating design or method. Arbitrary judgments and decisions that draw on 
equating expertise and experience are always needed. Factors such as 
feasibility, cost, and any unique testing context should all be considered. 

Brennan and Kolen (1987) argued that the test content and statistical 
specifications for tests being equated ought to be defined precisely and be 
stable over time. In the process of test construction, item statistics should 
be obtained from pre-testing or a previous use of the test. Each test should 
be reasonably long, with at least 35 items, and the scoring keys should be 
consistent. The stems for common items, alternatives, and stimulus materials 
should be identical for the forms to be equated. The characteristics of 
examinee groups should be stable over time, too. The sizes of the groups 
should be relatively large, larger than roughly 400. The curriculum, training 
materials, and field of study should also be stable. The test items should be 
administered and secured under standardized conditions. 

Criteria for Selecting Equating Methods 

Usually equating method is selected or tailored to accommodate the need 
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of a particular testing situation. The three major aspects to be considered 
for the selection of equating method are reflected in these questions: (1) Are 

the underlying assumptions required tenable? (2) Is the procedure practical? 
and (3) How good is the equating result? (Crocker & Algina, 1986) 

Tenabilitv of Model Assumptions 

The premise of model application is that all the underlying assumptions 
of the selected model hold. Linear equating assumes that the score 
distributions of the tests being equated have identical shapes, and is 
appropriate for equating use when score distributions only differ in the means 
and/or standard deviations. Equipercentile equating requires fewer 
assumptions than linear equating. However, in theory, it associates with 

larger errors than linear equating does (Lord, 1982a) . Both linear and 

equipercentile equating assume that the tests being equated measure the same 
trait and have equal reliability. 

Given tests that have different average difficulty, linear and 
equipercentile equating are likely to yield erroneous results. The results of 
these methods also fail to meet the condition of equity and population 
invariance (Hambleton & Swaminathan, 1990) . Unlike these methods, IRT 
equating does not have the same drawbacks and could be a better alternative. 
Applicability of Design and Method 

Random groups design, single group design with counter-balancing, and 
common-item nonequivalent groups design are three common designs used to 
collect data before equating (Kolen & Brennan, 1995) . Random examinee groups 
design is desirable because each examinee only has to take one form and 

several forms can be equated at the same time. Nevertheless, it requires the 
test forms to be available and administered at the same time, which is 
sometimes not practical. One solution to this problem is the use of anchor 
design. Either test forms with embedded anchor items (the internal anchor) 
can be given to different examinee groups, or a third test (the external 

anchor) can be given to both examinee groups that take different test forms. 

Without random assignment, the score distributions of anchor items for 
different sub-populations may be markedly different and the assumption of 
equity is unlikely to hold (Crocker & Algina, 1986) . In such case, linear or 
equipercentile method is likely to yield inaccurate results, whereas IRT-based 
methods seem to have more accurate results. 

Equating Accuracy 

A major concern for test equating is to what extent the equated scores 
are equivalent. Random equating errors result from the sampling of examinees 
and can be controlled by using large examinee samples and choosing appropriate 
equating designs. Systematic equating errors, whereas, are caused by 
violations of assumptions and conditions of equating methods. Sometimes, 
systematic errors can be so large that the results of equating may be worse 
than no equating (Kolen & Brennan, 1995) . To reduce systematic errors, the 
conditions of equating and assumptions made in equating should be carefully 
examined . 

Perfect equivalency can never be achieved because true score can only be 
estimated. Consequently, there is no absolute criterion for evaluating 
equating accuracy. In practice, equating results are often compared against 
some arbitrary sound criteria to study equating accuracy. Therefore, equating 
accuracy is an estimate depending on the nature of the arbitrary criterion 
used. It may be unreasonable to compare all kinds of equating results against 
one single criterion, because equating methods vary in their assumptions and 
estimation procedures. 

Typically, conventional equating methods that have been known to be 
satisfactory in yielding accurate results, or have been used in practice for 
quite a time, are used as evaluation criteria for equating accuracy. Skaggs 
and Lissitz (1986) argued that the best situation for research purposes was to 
equate a test to itself through intervening forms. 

Tucker Linear Equating 

Linear equating has the appeal of simplicity in terms of score 
transformation and is used most often with the anchor-item design (Kolen & 
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Brennan, 19 87) . Among the many linear methods. Tucker linear equating is one 
of the methods employed most frequently. 

Synthetic Population 

For anchor-item design, Tucker's method involves the use of a synthetic 
population (Braun & Holland, 1982) . A synthetic population is usually defined 
as a combination of the proportionally weighted (proportional to sample sizes) 
populations of examinees taking different test forms. Typically, an equating 
function is viewed as being defined for a single population, therefore, the 
two examinee populations must be combined as one single population for 
defining an equating relationship (Kolen & Brennan, 1987 ) . 

Model Assumptions 

In an anchor-item equating design, suppose . Population 1 take Form X , 
Population 2 take Form Y, and V is the embedded set of anchor items in both 
forms; to equate scores on Form X to the scale of Form Y, Tucker linear 
equating requires some strong statistical assumptions as follows (Kolen & 
Brennan, 1987; Kolen & Brennan, 1995) : 

1. The linear regression function (slope and intercept) for the 

regression of X on V is the same for Populations 1 and 2. The function for 
the regression of Y on V is also the same for the two populations. 

2. The variance of X given V is the same for the two populations, and 
the variance of Y given V is also the same for the two populations. 

Under the above assumptions, the linearly transformed scores on one 
form, yielded by Tucker's method, will have the same mean and standard 
deviation as the scores on another form. Because of the assumptions about the 
variances and regression functions in relation to the two populations, Tucker 
linear equating is more accurate when examinee groups are similar. 

Equating Procedures 

Using the proportional weights to form a synthetic population, Tucker 
linear equating basically involves the following concepts and procedures 
(Kolen & Brennan, 1987; Kolen Sc Brennan, 1995): 

1. Find the weights for Populations 1 and 2 by using these formula: 
w^^/ (n^nj and w 2 =n 2 / (n^nj , where n 1 and n 2 are the sample sizes of examinees 
from populations 1 and 2 respectively. 

2. Let a x and a 2 be the regression slopes for the populations, then for 
Population 1, 



OCifX Iv) =<Ti(X,V) /a? (V) and Ct, (Y |v)=0, (Y, V) / a\ (V) 
and for population 2, 

a 2 (x Iv) =a 2 (x, V) / a\ (V) and a 2 (Y |v)=o 2 (y, v) / o\ (V) . 

In addition, let P L and P 2 be the regression intercepts for the two 
populations, and m and |l 2 be the population means, then 

p 2 (x Iv^txj-a^xfcnmm and P,(Y Iv) =m(Y) - 0 t 1 (Y k)m(V) / 

and 

p 2 (x |v)=h 2 (X)— a 2 (X Iv)|i 2 m and P 2 (Y |v) =|i 2 (X) -a 2 (Y M) (i 2 (V) . 

To compute the a, (X |v) and a 2 (Y |v) , observed data can be plugged in to the 
above equations. 

3 . By assumptions about the slopes and intercepts for the two 
populations, a 2 (X |v) =a 2 (X |v) , a 2 (Y |\/) = a 2 (Y|v), P 2 (X |v) =P 2 (X |v) , and P,(y|v) = 

P 2 (y|v) . And, by assumptions about the same variances for the two 
populations , 



a\ (X) [l -p\ (x , v) ] = c\ (X) [l -p\ (x, v) ] , 
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a\ (Y) [l-pf (Y,V) ] = aj (Y) [l-Pj (Y, V) ] . 

4. With the above assumptions, it can be demonstrated that 

m (Y) =m (Y) +a 2 (Y |v) [^(V)-m(v) ] , mw^xj-a.fxlv) [^(vj-mtv) ] , 
a? ( Y) = al (Y ) + a\ (Y K/) [af (V)-ol (v) ] , o\ (X) = fr? (X)-a? (X |v) [a? (v)-^ (V) ] , 



and 

^ ( Y, v) =cx 2 (y,v) [ o} (V) / o\ (v)] , cx 2 (x, v) =o 1 (X, v) [ o\ (V) / a? (V) ] . 

5 . The parameters for the synthetic population can be expressed by the 
weights and the parameters of Populations 1 and 2 . The equations for the 
population means are (a) (l a ( X ) =w 1 (l 1 ( X ) +w 2 |l 2 ( X ) , (b) (l s ( Y) =w 1 |l 1 ( Y) +w 2 |l 2 ( Y) , and (c) 
MJV) =w iMi ( V) +w 2 |i 2 (V) . And, the population variances are 

Os (X)=w,<T? (X) +w 2 o\ (Xl+w^t^fXJ-^fX) ] 2 , 



a\ (Y)=w lf r? <Y)+w 2 o\ m+v^wjmm-mm ] 2 , 

and 

os ( v) = Wj of ( v) +w 2 o\ ] 2 , 

where s denotes the synthetic population. 

6. Substitute the equations in step 4 in the equations in step 5, the 
means and variances for the synthetic population on Form X and Form Y can be 
derived as follows: 

m (x) =m (x) -w 2 a 2 (x b [m (v) -m(v) ] , 

\1, ( Y) (Y) + Wl a 2 (Y |v) [Jl, (V) -\l 2 ( V) ] , 
o\ (X)=CT? (X) -w 2 o\ (x|v) [f T? (V)-f tI (V)]+ Wl W 2 af (x|v) [ (i., ( V ) -)i. 2 ( V ) ] 2 , 

and 

a 2 s (Y ) = ol (Y) +vt l0 l (Y |v) [f T? (V)-ol (V) ]+ Wl w 2 cd (Y |v) [^(VJ-^fV) ] 2 . 

To obtain estimates for the means and variances for the synthetic population, 
plug in observed data to the above equations. 

7. After taking the square roots of 6] (X) and 0 2 s (Y) , the equation for 

Tucker-linear transformation, ^(jc)=a s (Y)/a s (X)[jc-|l s (X) ]+|i s ( Y) , is obtained by 

replacing the parameters in the above equation with the estimated values 
obtained previously. 

Some Practical concerns 

Despite the fact that equal reliability is needed for Tucker linear 
equating, Kolen and Brennan (1987) argued that, if the test forms were 
designed to be as similar as possible in content and statistical 

characteristics, and have the same length, small differences in reliability 
were not likely to have negative influences on the equating of the two forms. 

Compared to Levine equally reliable method, another frequently used 
linear method that requires the assumption of perfectly correlated (r=1.0) 
true scores on the two forms. Tucker linear method is often considered more 
appropriate when examine groups are more similar and test forms less similar. 
Levine method, whereas, is often said to be more appropriate when test forms 
are more similar and examinee groups less similar. Nevertheless, research 
findings have not yet provided clear evidence for the argument (Kolen & 
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Brennan, 19 87 ) . 

IRT Equating Methods 

Classical methods of equating, developed for equating observed raw 
scores, are criticized for not being able to meet the conditions of equating 
(equity, symmetry, and invariance) . Equating based on item response theory, 
whereas, does not suffer from the same drawbacks, given the IRT model fits the 
data (Hambleton and Swaminathan, 1990; Kolen, 1981) . The result of IRT 
equating, however, varies with the particular equating technique or procedure 
used. This section provides an overview for IRT equating using anchor-item 
design . 

Linear Transformation of IRT Scales (Two-stage Method) 

IRT parameter estimates, obtained from alternate forms of a test, can be 
converted to the same scale via linear transformation (Kolen and Brennan, 
1995) . Assuming item and person invariance, linear transformation is 
reasonable for the non-equivalent-group anchor-item design because the 
difficulty and discrimination parameters for the common items from the 
alternate forms are linearly related (Petersen, Cook, & Stocking, 1983; Hills, 
Subhiyah, & Hirsch, 1988) . 

In theory, given 3 PL IRT model fits the data, transformation equations 
relating IRT parameters for alternate forms of a test (say, Form X and Form Y) 
are defined as follows (Hambleton and Swaminathan, 1990; Kolen & Brennan, 
1995) : 

(1) For person i , the equation for the ability parameter is 0 V =AQ y ,+B , 

y j 

where A and B are constants and Q y and B x . are the values of person i 's 
ability on the scales of Forms Y and X. 



(2) Let a y . / }y / and c be the item parameters for item j on Form Y 

1 y j y } 

scale, and a ' b ' an< ^ C ke t ^ ie Parameters on Form X scale, (a) the 

X j X j X j 

equation for item discrimination parameter is a v ~ CL /A, (b) the equation for 

y j x j 

item difficulty parameter is b yj = Ab Xj + B, and (c) the equation for lower 

asymptote (guessing) parameter is c v =c 

y j Xj 

For a group of persons or items, Kolen & Brennan (1995) showed that the 
transformation constants (A and B) can be expressed as follows: 



A =a ( by )/a(b x )=\i(a x )/\i( a y ) =0 ( Q y ) /a ( 0 X ) , 

and 

B=\i(by)-A\i(b x )=\i(d y )-A\i(e x ) . 



In the above equations , the means fx (a x ) * |l ( a y ) , (1(6*), and ( b y ) , as 
well as the standard deviations G(b x ) and 0 ( b y ) , are defined over items. 
And, the means (1(0*) and \l(6 y ), as well as the standard deviations 0(0*) and 
0 ( By ) , are defined over persons. 

In practice, IRT parameters are unknown and thus need to be estimated. 
In anchor-item equating design, parameter estimates for anchor items can be 
obtained and used to replace the parameters in the above equations to find the 
scaling constants. Basically, linear transformation of IRT scales involves 
two stages: (a) first, alternate test forms are calibrated separately, (b) the 
information on anchor items obtained from the two IRT calibrations are then 
used to derive transformation equations for person and item parameters, which 
can be used to arrive at equivalent scaled scores for examinees taking 
different test forms. 

Other than the above scale- transformation procedure, various techniques 
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for transforming IRT scales have been proposed. Regression techniques can be 
applied, but the established relationship is not symmetric (Hambleton and 
Swaminathan, 1990). The mean/sigma method (Marco, 1977), the mean/mean method 
(Loyd & Hoover, 1980), and the method involving the use of the geometric means 
of the a-parameters (Mislevy & Bock, 1990) are all straightforward and similar 
to the procedure described above. Taking into account individual standard 
error of estimate, the robust mean and sigma method (Linn, Levine, Hastings 
and Wardrop, 1981) and robust iterative weighted mean and sigma method 
(Stocking & Lord, 1983) use variance-weighted means and standard deviations to 
find the transformation constants. In short, poorly estimated parameters with 
larger variances receive less weights. The iterative method also weights 
outliers less. 

The above methods, however, suffer from a common flaw; that is, they do 
not take into account all of the item parameters at the same time. As a 
result, various combinations of a-, b- , and c-parameter estimates may result 
in very similar item characteristic curves over the range of the most 
occurring ability. 

Characteristic Curve Transformation (Formula) Methods 

Unlike the above methods, characteristic curve methods developed by 
Haebara (1980) and Stocking and Lord (1983) consider the parameter estimates 
simultaneously. The two methods estimate the difference between the item 
characteristic curves on the two scales, for a given 0 and over items, 
differently. However, both methods rely on iterative algorithms that minimize 
the overall differences over examinees to find the transformation constants (A 
and B) . 

It is found from some comparison study that the characteristic curve 
transformation methods yielded more accurate results than the other methods. 
Nevertheless, the results did not differ much sometimes (Baker & Al-Karni, 
1991). In addition to the computationally intensive iteration procedures, the 
characteristic curve methods also have the limitation of not explicitly 
accounting for the error in estimating item parameters (Kolen & Brennan, 
1995) . 

Fixed-b Method 

The fixed-b IRT equating method sequentially calibrates test items 
following these steps: 

(1) Estimate bs and other item parameters for Book-A items; 

(2) Calibrate Book-B items by fixing bs for the anchor items at the values 
obtained from the previous step; 

(3) Book-B scale is then fixed onto the scale of Book A (Petersen, Cook, & 
Stocking, 1983; Hills, Subhiyah, & Hirsch, 1988). 

IRT True-Score Equating 

In theory, true scores on alternate tests or test forms can be obtained 
and equated. To eliminated negative scores and to provide a readily 
interpretable scale, values on 0 (ability) scale may be transformed to their 
corresponding true score values (Hambleton, Swaminathan, and Rogers, 1991). 
Then, the true scores on alternate forms can be equated via some linear 
transformation . 

IRT true Scores . 

Let 0 be the parameter of ability and n be the number of items in a 
test, true score can be defined as follows: True score (£) = X /?.(0) (Crocker 

l 

and Algina, 1986; Lord, 1980; Hambleton & Swaminathan, 1990). When comparing 
tests or test forms of different lengths, instead of £ , true proportion 
correct or domain score (7t) can be reported. Ranging between 0 and 1, n is 
computed by dividing ^ by the number of items (n) in test forms-- 7i=^/n 
(Hambleton & Swaminathan, 1990; Hambleton, Swaminathan, and Rogers, 1991). 

Taking into account the numbers of alternative options, which has 
substantial influence on guessing, the true score formula can be rewritten to 
(Petersen, Cook, & Stocking, 1983): 
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True score (4) = !{[(*, + l)/*,]x p,(B)-l/k() . 

where n is the number of test items, and £, + 1 is the number of alternative 
answers of item i . 

Equating true scores. 

Suppose the ability level of an examinee on test for X is 0 x an( ^ ^ is 
the corresponding true score, and the ability level of the same examinee on 
alternate form Y is 0 y and ^ i s the corresponding true score; then the 
equating equations for true scores are 

n m m 

% x = lPj(d x ) and £ = lpj(e y )= lPj(ccO x + P) , 

1=1 j=\ j=\ 

where (1) n is the number of items on test X and m is the number of items on 
Y, (2) Pi(6 x ) is the probability of a correct answer to item i by an examinee, 

whose ability level on test X is 0 X , (3) Pj(0 y ) is the probability of a correct 
answer to item j by an examinee, whose ability level on test Y is Q y t an £ (4) 
6y = aB x + t5 expresses the linear relationship between 0 y and 6 X (Hambleton & 
Swaminathan, 1990) . In theory, for a given value 0 X , the pair of true scores 

(<!L,£ V ) on tests X and Y can be determined. In practice, however, true 

* y 

scores can only be estimated. 

Advantages of IRT Equating 

Traditional equating methods can yield good results if the test forms 
are sufficiently parallel (Lord, 1980) . However, when the tests to be equated 
differ in difficulties, IRT methods are considered to be better than linear 
methods. Major advantages of IRT equating include: (a) its flexibility in 

modeling either linear or curvilinear relationship between raw scores on 
alternate test forms, (b) equal reliability or identical observed score 
distributions is not assumed (Cook & Eignor, 1983; Kolen, 1981), (c) "item- 

free" estimates for persons and "person-free" item characteristics (Lord, 
1977) are attainable, (d) unlike traditional equating methods, which only 
yield one single standard error of measurement for all examinees, error of 
measurement for ability estimation at each ability level can be estimated by 
IRT model, and (e) it may yield equivalent ability estimates for item sets 
differing in difficulty and/or discrimination, though not without measurement 
error (Green, Yen, & Burket, 1989) . 

Other appeals of IRT in practice are: 

(1) It provides better equating at the upper end of the score scale, 
where important decisions are often made. 

(2) It improves the flexibility in choosing among editions of a test, 
given the editions are placed on the same scale. 

(3) If re-equating is needed, which usually occurs when certain items 
are added or dropped, it is easier to obtain the true score estimates with the 
IRT methods . 

(4) It enables pre-equating, which derives the relationship between the 
test editions before they are administered operationally, given the pretest 
data are available (Cook & Eignor, 1983). 

(5) For test forms across years that differ somewhat in content and 
length, the IRT equating may reduce the bias or scale drift in equating chains 
of circular-equating paradigm, and the stability of the scales near the 
extreme values may increase (Petersen, Cook, & Stocking, 1983; Hills, 
Subhiyah, & Hirsch, 1988) . 
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Despite all the advantages listed above, Kolen and Brennan (1995) 
pointed out that IRT models gained their flexibility by making strong 
statistical assumptions and these assumptions were not likely to hold 
precisely in real testing situations. As a result, robustness of IRT models 
to the violations of model assumptions needs to be studied. Green, Yen, and 
Burket (1989) noted that it was not safe to say that IRT method would yield 
equivalent ability estimates if the items in different forms were different in 
content coverage. Therefore, test content should be carefully considered in 
IRT equating. Sometimes, the results of IRT equating agree with linear 
equating to a surprising degree. One possible explanation is that the test 
forms being equated are constructed to be similar considerably (Berk, 1982). 

Effects of Examinee-Group Differences 
Ideally, equating results should be independent of sub-populations of 
examinees of the same ability. Lawrence and Dorans (1990) suggested 
population independence be investigated under circumstances that examinee 
samples differed in ability. 

Ability difference between examinee samples may have serious impacts on 
equating results (Cook, Eignor, & Schmitt, 1988) . Theoretically, the closer 
the groups in ability, the more accurate the equating will be. However, 
Marco, Petersen, and Stewart (1983) found that if anchor test mirrored the 
content and difficulty level of the entire test, sample differences had 
relatively small and unsystematic effects on the quality of equating results. 

Effect of Characteristics of Anchor items 
The characteristics of anchor items, particularly the content 

representation and number of anchor items, may be influential on equating 
results . 

Length of Anchor 

Although there is no absolute standard for setting the number of an 
anchor items, a rule of thumb is to include at least 20 items or 20% of the 
total number of items in a test, whichever is larger (Angoff, 1984) . Several 
studies have shown that as few as five or six carefully selected anchor items 
would perform satisfactorily for the IRT equating, when the item parameters of 
alternate tests were estimated by IRT concurrent method (Raju, Edwards, Sc 
Osberg, 1983 ; Wingersky & Lord, 1984; Raju, Bode, Larsen, Sc Steinhaus, 1988; 
Hills, Subhiyah, Sc Hirsch, 1988). Nevertheless, using IRT concurrent method, 
Hills, Subhiyah, and Hirsch (1988) found that randomly selected anchor items 
was not sufficient for producing satisfactory equating result, at least ten 
items was needed. 

Content Representation 

Whether anchor items are representative subset of the entire test, in 
terms of content and statistical properties, is especially important when 
examinee groups vary in ability (Cook & Petersen, 1987) . Budescu (1985) 
pointed out that the magnitude of relationship between anchor test and unique 
components of each test form was the single most important determinant for the 
efficiency of equating. The relationship, however, depended on the 

reliability of the total test and the relative length of its two components. 
When non-random groups in an anchor design performed differentially, Budescu 
suggested that it was important to select anchor items that cover various 
content areas of a particular test to reflect the content mix of the entire 
test . 

Equating Test Scores from Skewed Distributions 
Often, equating is conducted for large scale achievement tests that have 
approximately symmetrical and bell-shaped score distributions. From time to 
time it is necessary, though, to equate tests that have skewed score 
distributions such as minimum- competency tests and licensure exams that have 
high passing standards. For licensure or certification programs, test forms 
are often equated with special interest on a particular cut-off score, or a 
range of scores, to inform decision making. To maximize the precision of the 
decision, it is reasonable to pay more attention to improve equating in the 
cutting score region, even at the expense of poorer equating at other scores 
(Brennan & Kolen, 1987). 
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Hills, Subhiyah, and Hirsch (1988) equated a minimum- competency test to 
an early version administered two years before and found that the results of 

the five equating methods used were generally similar to one another. They 

thus concluded that IRT equating methods could be applied to equating minimum- 
competency tests with extremely skewed distributions. 

Assessing Equating Adequacy 

Equating outcome can be evaluated in terms of its accuracy, sample 
invariance, and scale stability. This section of review focus on the 

estimation of equating accuracy, which is more relevant to the study design. 
Criterion for Evaluating Equating Accuracy 

It was found that IRT-based methods were better at equating both 

parallel and non-parallel tests (Kolen, 1981), effective for both inter-level 
and inter-form equating (Green, Yen, & Burket, 1989) , and would yield more 
accurate equating outcomes than conventional equating (Petersen, Cook, & 
Stocking, 1983; Hills, Subhiyah, & Hirsch, 1988). These findings, however, 
may be tentative if the criterion used to evaluate the equating accuracy of 
IRT methods was biased. Therefore, in evaluating the effectiveness of various 
equating, it is important to seek a relatively unbiased criterion. 

Often, equipercentile equating is used as evaluation criterion because 
it usually yields satisfactory results. In a comparative study, Livingston, 
Dorans, and Wright (1990) regarded equipercentile relationship as true 
equating relationship because true scores could be precisely estimated. Yen 
(1985) also suggested the use of equipercentile equating because it was as 
accurate as IRT equating. 

Indices of Equating Accuracy 

One common index used to represent equating accuracy is root-mean- 
squared deviation (RMSD) , also known as root-mean-squared error of equating 
(RMSE) . Suppose Form-B of a test is equated to Form-A, then 

RMSD = { [ i n,(x, - x y ) 2 ] / X n y }' n , 

Y=\ 



where (a) n y is the number of examinees with raw score y on Form-B, (b) x y 

is the corresponding exact scaled score on Form-A determined by criterion 
equating, (c) x y is the corresponding exact scaled score on Form-A 

determined by the equating to be evaluated against the criterion, and (d) the 
summation is over the raw-score levels on Form-B (Klein & Jarjoura, 1985; 
Livingston, Dorans, & Wright 1990) . 

Mean equating error, the bias that contributes to RMSD, can also be used 

as an index. It is estimated by: BIA S= X - X , where X is the mean of the 

criterion scores and X is the mean of the equivalents (Klein & Jarjoura, 
1985) . In addition, Marco, Petersen, and Stewart (1983) investigated the 

adequacy of curvilinear score equating by using squared bias and standardized 
weighted mean square difference, which weighted more on values that occurred 
more often, as indices of accuracy. 

Dimensionality Issues 

The robustness of IRT model to the violation of its assumptions is a 
major concern in IRT equating, because achievement tests usually cover 
multiple content topics, which may be influential on IRT model fit. 

Definition of Unidimensionali tv 

Test scores are most meaningful when all the items depend on a single 
trait. If the IRT assumption of unidimensionality holds, local independence 
should be observed. Statistically, local independence requires that, for 
fixed ability level 0, the item characteristic functions for any pair of items 
i and j should be independent (Lord, 1982b) . If the probability for the given 
responses to the given items i and j are not independent at fixed 0, the 
responses may depend on some trait other than the 0. Hence, the IRT 
assumption of unidimensionality is violated. 
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Robustness of Unidimensionali tv Assumption 

It has been shown that the violation of unidimensionali ty might have an 
impact on equating, but the effect might not be substantial (Dorans & 
Kingston, 1985) . It depended on how the violation of the assumption is 
formulated. It was found that dimensionality violation would cause asymmetry 
of equating and influence the estimated magnitude of item discrimination 
parameter. However, similar equating outcomes were also found in equating 
tests differing in their dimensionality. It suggested that IRT equating might 
be robust against the violation of unidimensionality assumption. Or, it could 
be hypothesized that there was an overall ability, which could be 
conceptualized as a weighted composite of separate component abilities (Dorans 
Sc Kingston, 1985; Reckase, Ackerman, & Carlson, 1988; Yen, 1984). 

Reckase, Ackerman, and Carlson (1988) had demonstrated that items 
measuring the same weighted composite abilities would meet the 
unidimensionality assumption for most of the IRT models. Dorans (1990) also 
argued that, although tests ought to measure the same construct and have the 
same content mix, they did not have to be composed of unidimensional items. 

If a test involved independent traits that influenced only a few items, Yen 
(1984) suggested that these traits might be ignored when the unidimensional 
trait was defined. 

Limitations of Equating 

Equating cannot solve problems originated in rough or improper test 
construction. It is mainly developed to improve on a test fairly constructed 
but fails to yield parallel forms. All conventional equating and IRT equating 
are primarily designed for test forms that have minor differences in their 
difficulties. Cook and Eignor (1991) indicated that no equating method could 
satisfactorily equate tests that were markedly different in difficulty, 
reliability or test content. As a result, there is a concern about the 
feasibility of vertical equating, which transforms scores across levels of 
achievement onto a single scale. 

Due to floor and ceiling effects, tests that differ in difficulty are 
not likely to be equally reliable for all sub-groups of examinees (Skaggs & 
Lissitz, 1986) . But, equal reliability is usually assumed in test equating 
such as linear equating and equipercentile equating. Thus it was argued that 
observed scores on tests differing in their difficulties cannot be equated. 
In practice, nevertheless, equating is conducted in its loose sense for a 
pragmatic purpose-- to approximate an ideal equivalency. 

Description of Data 

The particular test data used in this study has a rich content mixture 
(items were from 23 content sub-areas) , which enables this study to 
investigate a variety of equating issues such as the characteristics of anchor 
items. Specifically, scores on the two forms, Book-A and Book-B, of a 1993 
in-training examination taken by the candidates of a medical specialty were 
analyzed. The candidates took the test, while participating in various in- 
training programs located at different sites (usually in hospitals) , to 
prepare for the board certification examination. No absolute score was used 
to determine pass or fail. The passing standard was 75% of the total test 
items being correctly answered. 

To become board-certified, the candidates were strongly motivated to 
participate in the in-training programs for the preparation of the 
certification exams. Since the in-training test provided candidates valuable 
opportunities to get familiar with the formal certification exams, it was 
assumed that the candidates had taken the test as serious as when the formal 
exams were taken. 

Test Content and Format 

The test forms were comprised of five-alternative multiple-choice items, 
and the content of all the items were emergency-medicine-related. The item 
responses were all scored as right or wrong (coded as 1 or 0) . Book-A had 203 
items, of which 58 items were unique to Book-A. There were 52 unique items in 
Book-B, and the total number of items was 197. There were totally 145 anchor 
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items, and the anchors were identically embedded in both forms in terms of 
wording and location. 

Examinee Groups 

A total of 2,242 candidates took the in-training test. After screening 
the data, a case that had apparently guessed throughout the entire test was 
deleted from the analysis to secure the validity of scoring. Among the 2,241 
subjects, 1,092 took Book-A and the rest of 1,149 took Book-B. 

The examinee group taking Book-B scored higher in average on the anchor 
items, therefore it was likely that this group of examinees had higher 
ability. Nonetheless Lord (1981) mentioned, the difference in ability level 
would not influence equating result, given anchor-test design was employed. 
In addition, the group taking Book-B had a lower mean score on the unreduced 
full-length test. This implied that the unique items in Book-B had higher 
difficulty in average. 

The test forms generally met the equating requirements that were 
mentioned earlier in the review of equating guidelines. Specifically, the 
test was reasonably long and all the items were from one single item pool. 
The anchor items constituted the major part of the total test. Some of the 
items were administered in the previous year under the same standardized 
testing situations. The size of the examinee groups, over 2,200 subjects, 
were reasonably large. In addition, the scoring key was clear and the test 
results appeared to be stable, given the preliminary analyses based on the 
classical test equating. 



Research Design 

Using four different item sampling schemes, pairs of test forms were 
assembled in this study with items sampled from the same big item pool. The 
various schemes were devised to manipulate the content mix of the tests, or 
the content representation of anchor items embedded in the test forms. The 
pairs of test forms were then equated, using anchor-item equating design with 
non-equivalent examinee groups, by Tucker linear method and two IRT methods. 
The equating results yielded by the different methods were compared against an 
appropriate criteria for evaluating equating accuracy that had several nice 
appeals . 

Overall, content representation of anchor items and equating method are 
the two variables delineating the entire study. Other than the summary 
presented in Tables 1 and 2, basic research designs of this study are further 
elaborated in the following paragraphs. 

Internal Anchor-Item Equating Design 

The two examinee groups taking alternate test forms were not formed by 
random selection or assignment. Therefore, equating was made possible by the 
common items embedded in the alternate test forms. For the original test 
forms, the content of the anchor items was made representative to the entire 
test, and the anchor items were embedded in alternate test forms with same 
wording and at the same positions. 

Manipulation of Content Representation of Anchor Items 

All the items in the two original test forms are from a single big 
content domain. However, the content domain can be divided into 23 sub- 
content areas. Pooling together the items from the two original test forms, 
four subsets of items were drawn to form shorter test forms that had similar 
number of anchor items but the anchor items differed in their content 
representation. Thus the effect of content representativeness of anchor items 
on test equating could be studied. In general, the test lengths of all the 
shorter test forms (about 60 items) reflected the common test length seen in 
testing practice, and the various item sampling schemes used in this study 
were also used frequently in test construction. 

Assumptions of Item Sampling Schemes 

Various assumptions about the content of the test, used in this study, 
were made by the four item sampling schemes. They were briefly summarized in 
this section and details of the item sampling schemes and the sampling results 
were described in Appendix A. 



Table 1 

Summary of Basic Research Designs (1) 
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The simple random sampling disregarded the existence of the 23 sub- 
content areas and randomly drew items from the big item pool. The equal- 
weight domain random sampling (random sampling stratified on sub-content 
areas) assumed that each of the 23 sub-content areas represented a significant 
part of the medical content domain, and these sub-content areas were equally 
important. The proportional domain random sampling assumed that the size of a 
sub-content area reflected its significance, therefore, it drew from each of 
the 23 sub-content areas a number of items proportional to the size of the 
area. And, the purposeful sampling included only the items from the largest 
three content sub-areas, assuming that the number of items in a sub-content 
area reflected the importance of the content. 

If a test form involved a smaller number of sub-content areas, we would 
have more confidence in the assumption of unidimensionality made about the 
content of the test form. 

Controlling for Anchor-length Effect 

From a previous study using the same data, it was found that equating 
accuracy depended on the number of anchor items in the test forms being 
equated. Specifically, equating results from test forms that had longer 
anchor lengths tended to be more accurate (Yang & Houang, 1996) . Therefore, 
in this study, the numbers of anchor items in various shorter test forms were 
fixed at 30 to avoid the confounding effect resulted from different anchor 
lengths. A number of 30 anchor items had been found to yield sufficiently 
accurate equating results. 

Due to limited number of items available for item sampling, it was 
difficult to compose tests forms that all had the same number of anchor items. 
Nevertheless, this study ensured that at least 30 anchor items, a sufficient 
number of anchor items, were embedded in all of the shorter test forms. 

Eguating Methods 

In addition to Tucker linear equating, two IRT-based methods were also 
used to equate alternate test forms for the study of method effect on equating 
accuracy. Both IRT-based equating are based on 3 PL IRT model to account for 
guessing, because the chance for examinees to guess on some items could not be 
ruled out. One of the IRT-based method used is the two-stage method, which 
linearly transforms estimated IRT parameters on one test form to the parameter 
scales of another form. The second IRT-based method used is the fixed-b 
method, which sequentially calibrates test items of alternate forms. The two 
methods differed in their parameter-estimation procedures and, hence, equating 
procedures . 

Criterion for Evaluating Equating Accuracy 

In this study, items were sampled from one big item pool to form shorter 
test forms. As a result, examinee performances on the complete set of 145 
common items in the big item pool could be regarded as the "anchor universe", 
relative to the anchor items embedded in the shorter test forms. "Pseudo true 
scores", the estimated true scores based on such "anchor universe", could be 
computed and thus used as eligible criteria for evaluating equating accuracy. 
However, such criterion was only appropriate when the examinee population and 
the testing occasion were considered fixed. 

The "pseudo true score" was estimated by using the total raw score on 
the 145 anchor items. Although such raw-score-based criterion were 
susceptible to some drawbacks, including being person-dependent and item- 
dependent, it would not be biased in overestimating the accuracy of IRT 
equating. Intuitively, the lower bound of equating accuracy could be 
estimated for IRT equating. Therefore, the raw-score-based "pseudo true 
score" was chosen to represent a conservative criterion. 

The accuracy of equating results were expressed by Pearson product 
moment correlation coefficient (r) . A bigger positive Pearson r would 
indicate a more accurate equating result. Specifically, true scores based on 
various equating results from the shorter test forms were estimated and then 
correlated to "pseudo true scores" to obtain the indices of equating accuracy, 
the Pearson rs . 
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Research Tools 

A variety of IRT calibration programs, such as ASCAL, BILOG, and LOGIST, 
were available for item and person estimation. The program chosen for the 
analyses of this study was the PC version BILOG. One advantage of using BILOG 
is that BILOG yields marginal maximum likelihood (MML) estimates and the 
number of parameters estimated does not increase with the increasing number of 
examinees. When the number of examinees increases, BILOG was found to yield 
more consistent results than LOGIST (Mislevy & Stocking, 1989; Baker, 1990). 
Yen (1987) also found that BILOG always yielded more precise estimates of 
individual item parameters. For shorter test with ten items, BILOG excelled 
LOGIST in estimating item and test characteristic functions; whereas for 
longer tests with 20 to 40 items, the two programs yielded similar estimates. 
Mislevy and Stocking (1989) also found that BILOG would yield more reasonable 
results when the examinee samples are smaller. 

In addition to BILOG, SAS for Unix and Excel spreadsheet were also used 
in this study to facilitate Tucker linear equating and all other sorts of data 
management and analyses . 



Research Limitations 

The scope and depth of this study was limited by personal interest and 
ability. Environmental conditions, such as the cost, the availability, and 
the capacity of computer packages for IRT calibration and equipercentile 
equating, also set limits for this study. Despite the fact that the rich 
context of the data analyzed in this study helped enrich the research 
questions and the study design, the data analyzed still set limits for this 
study in the sense that it was secondary data so any manipulations before and 
during data collection were not accessible. For instance, equating using 
anchor-item design was the only option for this study because the test forms 
were written with embedded anchor items and given to non- equivalent groups. 

Results and Discussion 

Results of classical, item analyses, correlation analyses on anchor items 
and none-anchor items, inspection on examinee group differences, IRT parameter 
estimations, equating outcomes yielded by various equating methods, as well as 
the evaluation of equating accuracy are all presented and discussed in this 
section. Issues concerning the use of the index of equating accuracy, the 
adequacy of 3PL IRT model, as well as the validity and reliability of anchor 
items are also considered. 

Classical Item Analyses 

Analyses on item difficulties showed that in general average item 
difficulties, ranging from 0.688 to 0.759, were quite similar for the four 
pairs of test forms and were considered moderate. The standard deviations of 
item difficulties within various test forms were also very similar, ranging 
from 0.145 to 0.153. These small standard deviations implied that items 
within the same test forms generally did not differ much in their 
difficulties. The distribution plots shown in Figure 1 further indicated that 
item difficulties were evenly spread within test forms for all pairs of test 
forms. Distributions of item-total correlation were presented in Figure 2. 
It was found that item scores generally correlated moderately to total test 
scores for all the test forms. 

In summary, classical item analyses suggested that (a) the alternate 
test forms created in this study did not differ much in item difficulty and 
item-total correlation, thus were good candidates for equating, and (b) the 
four pairs of test forms looked quite similar to one another in terms of 
average difficulty, which provided a fair basis for the study of the effect of 
anchor characteristics on equating accuracy. 

Representation of Anchor Items 

Results of correlation analyses on anchor items and none-anchor items 
(see Appendix B) provided a closer look at the composition of various test 
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forms. Overall, for each of the shorter test forms and the original longer 
test forms, anchor items and none-anchor items correlated to each other 
significantly to a moderate degree. The correlation coefficients ranged from 
.44 to .54 across various shorter test forms. Both anchor and none-anchor 
items of a test also correlated significantly with the entire test to a 
considerable degree. The correlation between anchor items and the entire test 
ranged from .86 to .97 across various test forms. As a result, it seemed 
reasonable to use the anchor items to equate the entire alternate forms. 

A shrinking trend was found in the correlation between anchor items and 
entire test across various test forms. In summary, (a) for test form Book-A, 
the magnitude of correlation coefficient decreased from .97 of purposeful 
sampling, to .94 of equal-weight domain random sampling, to .92 of 
proportional-weight domain random sampling, and to .86 of simple random 
sampling; and (b) for Book-B, the pattern of shrinkage remained, and the 
coefficient dropped from .97 to .94 to .93 to .86 accordingly. The shrinkage 
suggested that content representation of anchor items was likely to very with 
item sampling schemes. The purposeful sampling seemed to have yielded anchor 
items that were most representative of the entire test. It made sense because 
all of the items sampled by this scheme concentrated on merely three sub- 
content areas and were likely to be more similar in content. The simple 
random sampling scheme resulted in anchor items that seemed least 
representative. The finding could be attributed to the fact that the randomly 
sampled items scattered all over 23 sub-content areas such that the overall 
content was more heterogeneous. The similar results of equal-weight and 
proportional-weight domain random sampling might reflect the indifference 
between sampling items evenly from all sub-content areas and having more 
emphasis on larger sub-content areas. 

It should be noted, though , the magnitude of the correlation between 
anchor items and the entire test was inflated by auto-correlation because the 
internal anchor was a subset of the test. The magnitude of auto-correlation 
depended greatly on the number of anchor items embedded in a test. 
Consequently, whether anchor items were representative of the entire test 
should not be solely determined by looking at the correlation coefficient. In 
this study, however, the effect of auto-correlation were expected to be about 
the same on various test forms because their anchor lengths were fixed to be 
similar . 

Considerations of Group Differences 

Overall, the average raw scores of examinee groups taking different test 
forms did not differ substantially. Upon a closer inspection on the raw 
scores, however, it was found that examinees taking one test form (Book-B) 
scored slightly higher than examinees taking the other form (Book-A) on both 
anchor items and unique items across all pairs of shorter test forms. 

To further inspect examinee group differences, the average item 
difficulties broken down by test form and type of items were computed for all 
of the test forms. The results of the average item difficulties were 
summarized in Table 3. Slightly larger percentages were found consistently 
over various test forms for examinees taking Book-B on anchor items, 
indicating that the examinees might have higher ability than examinees taking 
Book-A. The group differences were probably due to the non-random selection 
or assignment of examinees in testing. 

As discussed previously in literature review, examinee- group disparity 
may be a threat to the equating accuracy of Tucker linear method, therefore 
Levine equally reliable method is sometimes recommended instead (Kolen & 
Brennan, 1987). In this study, however, Tucker method was still used because 
(a) the differences found between examinee groups were small and equating 
results of Tucker method were expected not to be affected, (b) the advantage 
of Levine method over Tucker method is still not clearly known (Kolen & 
Brennan, 1987) , (c) Levine method generally is more appropriate for more 
similar test forms, but the similarity between the test forms used in this 
study was not clearly confirmed, and (d) it was found that equating results 
yielded by the two methods for the original test forms were almost identical, 
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2. In this study, 1,092 examinees took Book-A, and 1,149 took Book-B. 
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thus it was safe to conclude that the two methods would make no difference for 
the test analyzed in this study. 

Estimation of IRT Parameters 

Results of IRT item and person parameter estimations were summarized in 
Table 4 for the four pairs of test forms. Roughly, the patterns of estimated 
parameters showed that test forms created by different item sampling schemes 
differed less in their average item discrimination but more in their item 
difficulties. The mean item difficulties on anchor items also differed across 
test forms, and the mean item difficulties for the test forms created by 
purposeful sampling of item looked especially different from the rest. The 
differences in the estimated item difficulties seemed suggesting some effect 
of item sampling on test and anchor characteristics. Comparing the mean item 
difficulties on anchor items for the two alternate forms (Book-A and Book-B) , 
it should be noted that purposeful sampling seemed to have created test forms 
that were more different than the forms created by the other item sampling 
schemes . 

Equating Ability Estimates 

Equated IRT ability estimates yielded by two-stage and fixed-b methods 
were correlated to compare the equating results of the two methods. Pearson 
correlation coefficients were computed and the results for various test forms 
are as follows: (a) r=0. 99985 for the test form composed by simple random 

sampling of items, (b) r=0. 99961 for equal-weight domain random sample, (c) 

r=0. 99961 for proportional-weight domain random sample, and (d) r=0. 99993 for 
purposeful sample. These nearly perfect and significant correlation strongly 
suggested that the two IRT equating methods were almost identical in 
determining the standings of individual examinees in a group. It could be 
argued that there was no IRT method effect on ability estimation in this 
study. 

Estimation of True Scores 

To obtain true score estimates, the following formula was used (Lord, 

1980) : 

Estimated true score ( f ) = X p.(0) = X{c,.+ (1 - c t )/ [1 + Exp - llai{e ~ bi) ] } , 

i=i t=i 

where 6 is examinee ability and n is the number of items. 

As expected, for all test forms, the correlation between estimated true 
scores based on the two IRT equating was almost perfect and significant. It 
was consistent with the findings on the IRT estimated ability estimates. Thus 
it was concluded that the two IRT equating methods were not different in 
equating the tests in this study and would place • individual examinees of a 
group in almost the same order. 

Results of Tucker Linear Equating 

For each pair of alternate test forms. Tucker linear method was applied to 
find an equating equation for transforming scores on Book-B to a set of new 
scores comparable to scores on Book-A. The Tucker equating equations derived 
for the four shorter test were presented in Table 5, along with a 
summarization of important statistics used to arrive at the equations. Using 
the Tucker equations, equivalent scores were established for test forms Book-A 
and Book-B. 

Evaluation of Equating Accuracy 

The total raw scores of examinees on all the 145 common items in the 
original item pool were computed and treated as the "pseudo true scores" . The 
"pseudo true scores" were then correlate with the estimated IRT true scores 
yielded by the two IRT equating, as well as the scaled total scores obtained 
by Tucker linear method. Pearson correlation coefficients were computed and 
used as indices of equating accuracy. Specifically, a positive and bigger 
coefficient would indicate a more accurate equating result. The collection of 
correlation coefficients between the "pseudo true scores" and the estimated 
true score yielded by various equating method for various test forms were 
presented in the big correlation matrix in Table 6 to illustrate the accuracy 




Table 4 

Results of IRT Parameter Estimation 



Alternate 

forms 


Composition of test 
forms 

^\Estimated 

parameter 


Simple 

random 

sampling 


Equal-weight 

domain 

random 

sampling 


Proportional- 
weight domain 
random 
sampling 


Purposeful 

sampling 




A 

a 


mean 


0.340 


0.342 


0.340 


0.444 






s.d. 


0.173 


0.168 


0.127 


0.192 




b 


mean 


-0.884 


-1.445 


-1.090 


-0.653 






s.d. 


2.239 


2.231 


2.043 


1.848 


Book-A 


A 

c 


mean 


0.252 


0.260 


0.256 


0.247 






s.d. 


0.046 


0.033 


0.029 


0.052 




b anchor 


-1.340 


-1.750 


-1.090 


-0.750 






mean 


0.003 


0.006 


0.005 


0.007 




9 


s.d. 


0.851 


0.854 


0.839 


0.897 






A 


mean 


0.377 


0.355 


0.410 


0.444 






a 


s.d. 


0.165 


0.162 


0.162 


0.041 








mean 


-1.008 


-1.561 


-0.380 


-0.904 




Using 


b 


s.d. 


1.891 


2.240 


2.361 


1.705 




IRT two- 


/V 


mean 


0.241 


0.270 


0.328 


0.231 




stage 


C 


s.d. 


0.034 


0.030 


0.052 


0.041 




method 














Book-B 




b anchor 




-1.45 


-1.88 


-0.840 


-0.180 




n 


mean 


0.003 


0.004 


0.012 


0.005 






C7 


s.d. 


0.868 


0.857 


0.858 


0.886 








mean 


0.400 


0.377 


0.433 


0.462 






a 


s.d. 


0.164 


0.165 


0.166 


0.194 




Using 




mean 


-0.591 


-1.200 


0.052 


-0.650 




IRT 


b 


s.d. 


1.951 


2.374 


2.301 


1.774 




fixed-b 




















mean 


0.311 


0.347 


0.384 


0.277 




method 


c 


















s.d. 


0.053 


0.048 


0.056 


0.049 








mean 


0.059 


0.011 


0.142 


0.061 






9 


s.d. 


0.880 


0.867 


0.869 


0.888 



Note: a = item discrimination parameter 
b = item difficulty parameter 
c = guessing parameter 
6 = person ability parameter 

b anchor =niean anchor item difficulty 





BEST COPY AVAILABLE 



marv of the Results of Tucker Linear Equating 



a p 
H c/D 






PQ 








M 








O 








o 








PQ 








two 
















c 








.2 
























*3 








x 








o 








x 




CO 




o 






c A 


B 




uo 


S 






C/3 




<2 




PQ 


c 

o 


c 

.2 




o 


p 


o 




o 


g 


0-< 




PQ 


o 


a> 




too 


o 


O 




a 


2 

c 

o 


c 

t o 




1 


(/) 


‘3 




c 


0> 


C/3 




o 


t-H 


0> 






x 


fcb 

2 


PQ 

i 


3 

3 

X 


> 


<U 


o 




•fi 


o 


X 


T3 


C/3 


o 


<D 


s 


S 


PQ 

c 

o 


B 

u 

,0 


PQ 

M 

O 


T3 

§ 


a 

o 

o 


o-i 

3 

top 


o 


< 


C /3 


*3 


PQ 




TD 

o> 


£ 




O 


> 


<D 


<2 


O 

PQ 

too 


U) 

<L> 

C/3 

X) 

o 


3 

T3 

C 




c 




c3 


c/3 

Q> 

2 

c 


a 


O 

3 

C/3 


r-^ 

oo 


<u 


o 




C/3 


C/3 








B 

x 


3 




< 

1 


<L> 


x 








o 


3 


o 


z 


cx 


c 


o 


PQ 


0) 


ccj 


PQ 


s 


*5 


C 


too 






O 


e 


<« 


2 


3 


3 






3 


3 


o 


c 


Oh 


a 


o 


<L> 


O 


o 


PQ 


O 


X 


"O 




s 


o 


3 




d> 


\b 


3 




o 


o> 


X 


■2 


o 

c 


■5 

c 


o 

X 


C/3 


.2 


>> 


0) 




’3 

C/3 

<L> 

feb 

2 


c/3 

•g 

C/3 

0> 

w 


3 


*3 

c 

<u 

C/3 


Wh 

2 

3 


a 


0> 


o 


top 


x 

a 


3 

C/3 


c 

o 

T3 


’5 

£ 


- 




- 




< 


0 




e 




CN 


co 





00 

CvJ 



!> 

C'Q 




Table 6 

Correlation Matrix for Evaluating Equaling Accuracy 

(Index of accuracy- Pearson r between 'pseudo true score' and true score estimate) 
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of various equating. 

Comparisons Among Equating Methods 

Overall, the indices of accuracy (Pearson correlation coefficients) 
ranged from .832 to .894 across various test forms. It seemed that the 
equating results yielded by the three equating methods were all accurate to a 
moderate degree, and examinees were generally ordered in a consistent way, no 
matter which method was used. 

Despite the fact that the indices of accuracy in Table 6 all looked 
similar, IRT equating appeared to have yielded more accurate results than 
Tucker linear method always. The only exception occurred when the test forms 
composed by proportional-weight domain random sampling were equated, where 
Tucker method (r=.860) seemed to do better than IRT two-stage method (r=.845). 
The results of the two IRT methods correlated strongly and the rs ranged from 
.976 to .999 (see the bolded numbers in Table 6), showing that the IRT methods 
yielded very similar results. The results of Tucker method, however, 
correlated less strongly to the IRT results, with rs ranging from .944 to .973 
(see the underscored numbers in Table 6) . 

Comparisons Among Test Forms 

Comparing the equating results on various test forms, it was found that 
both Tucker and IRT methods worked best for the forms composed by purposeful 
item sampling scheme, where the index of accuracy was .895 in average. The 
methods seemed to yield the least satisfactory results for the forms based on 
simple random sampling of items, where the mean accuracy was .847. In 
addition, the average accuracy for the test forms based on proportional-weight 
and equal-weight random sampling were .858 and .869 respectively, indicating a 
similarity in the item sampling effects of the two schemes. 

Effect of Content Representation of Anchor Items 

As discussed earlier, purposeful sampling yielded the most 
representative anchor items and random sampling resulted in anchor items that 
were least representative of the entire test. Combined the findings with the 
above outcomes, it seemed reasonable to conclude that equating accuracy might 
depend on the content mix or the content representativeness of anchor items. 
That is, Tucker linear method and the two IRT methods are more likely to yield 
more accurate results when anchor items are more similar to the entire test, 
or the content coverage of a test concentrates on fewer topics. In short, the 
characteristics of anchor items may have substantial impacts on the accuracy 
of test equating, regardless of the equating method used. As a result, to 
improve equating accuracy, it is important to include anchor items that can 
fully reflect overall content coverage of the entire test. 

Controlling Artifact due to Auto-correlation 

For the index of accuracy, there was a concern about auto-correlation 
caused by the fact that the "pseudo true score" was computed based on the 
complete set of 145 anchor items and the anchor items in shorter test forms 
were part of the complete anchor set. Due to the overlapping of items, 
correlation coefficients that showed the relationship between true scores and 
estimated true scores were inflated. To unmask the relationship to better 
estimate equating accuracy, "pseudo true scores” were correlated with the 
estimated IRT true scores that involved none-anchor items only. The results 
of correlation analyses were summarized in Appendix C. The same strategy for 
controlling auto-correlation, however, was not applied to Tucker linear 
equating. Because Tucker method is based on observed test score as a whole, 
unlike IRT methods that are more flexible in calibrating revised tests, it is 
not feasible to obtain scaled scores on non-anchor items only. 

After controlling the artifact due to auto-correlation, the patterns of 
rs found among various methods and test forms in previous section remain 
unchanged. The problem of auto-correlation seemed not to be serious, 
therefore the conclusions about the accuracy of various equating methods on 
different test forms and the effect of anchor characteristics were retained. 

Although the threat from auto-correlation may not be completely 
eliminated by removing anchor items from the correlation analyses, by 
controlling part of the artifact, the set of new indices of accuracy would 
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provide a better opportunity for understanding the effectiveness of equating 
methods . 

Concurrent Validity and Reliability of Anchor Items 

The data was further exploited to investigate the validity and 
reliability of anchor items, by correlating "pseudo true scores" with IRT 
estimated true scores using anchor items only. The results were summarized in 
Appendix D. Because the anchor items included in shorter test forms were part 
of the set of 145 anchor items, from which "pseudo true score" was derived, 
"pseudo true score" could also be regarded as a similar but more reliable 
measure for the anchor items. From this perspective, "pseudo true score" was 
used as a criterion measure to study the concurrent validity of anchor items, 
and Pearson correlation coefficient was computed as a measure of validity. 
Furthermore, by correlating an observed score (the estimated true score) with 
its corresponding true score (the "pseudo true score"), the correlation 
coefficient may be regarded as a reliability measures for the observed score. 
From this point of view, the Pearson rs in Appendix D were also measures of 
reliability. 

In summary, strong relationship was found between "pseudo true score" 
and anchor items for each of the shorter test. It provided some evidence of 
validity and reliability for anchor items. In average, the 
validity/reliability coefficient was .894 for the anchor items of the test 
form composed by purposeful item sampling schemes, .875 and .858 for the 
anchor items sampled by equal-weight and proportional-weight schemes, and .856 
for the anchor items drawn by simple random sampling. Given the validity and 
reliability evidence for anchor items used for equating, along with the 
equating accuracy found, both IRT equating methods were concluded to be 
satisfactory . 

Limitation of the Criterion for Evaluating Equating Accuracy 

As described earlier in the section of research design, "pseudo true 
score", the raw-score-based criterion for evaluating equating accuracy, is 
conceptually reasonable and will not over-estimate the accuracy of IRT 
equating. The evidence of reliability and validity, as well as the 
availability of data from all examinees, also support the use of the 
criterion. Nonetheless, it is limited in the following senses: (a) it is only 
appropriate when examinee group and testing occasion are considered fixed, as 
noted earlier, (b) in essence, it remains a convenient close estimate of true 
score that has measurement error, and (c) it is susceptible to problems such 
as person-depend and item-dependent, due to its raw-score-based nature. 

Alternatively, IRT estimated score can be computed using the 145 common 
items and used as another type of "pseudo true score" or criterion for 
evaluating equating accuracy. However, it is known that such IRT-based 
criterion may be biased in over-estimating the accuracy of IRT equating, while 
underestimating the accuracy of linear equating. Taking into account all the 
facts, the raw-score-based criterion was used in this study because it would 
provide a conservative estimate of equating accuracy for IRT equating. 

Adequacy o f 3 PL IRT Model 

The results of using the item and person parameter estimates of 3 PL IRT 
model for equating the minimum competence test analyzed in this study seemed 
adequate. As explained earlier, the use of 3 PL IRT model for parameter 
estimation is a logical choice. In addition, the satisfactory equating 
results yielded by the two IRT equating methods also help justify its use. It 
can thus be concluded that it is appropriate to include guessing parameter 
when tests or test forms with negatively skewed score distributions are 
equated . 



Suggestions 

Equating accuracy can be better estimated if unbiased evaluation 
criteria are identified and used. To compensate for the arbitrary and often 
biased nature of common criteria used for evaluating equating accuracy, 
multiple criteria can be devised to estimate equating accuracy so the 
estimation outcomes can be compared to determine the relative effectiveness of 
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these criteria. Therefore, in a subsequent study, several other criteria are 
proposed to evaluate equating accuracy, including a different "pseudo true 
score" based on estimated IRT true score using the 145 anchor items, and the 
results of equipercentile equating, which are often considered satisfactory. 

Assuming unidimensionality, in this study, 3 PL IRT model seemed to have 
yielded satisfactory estimates that were used to derive equivalent scores in 
subsequent equating process. However, due to the fact that there are 23 sub- 
content areas nested within the big content domain for the test, whether the 
assumption of unidimensionality holds seems ambiguous. 

If there are in fact more than one underlying traits for the test, then the 
findings of this study suggest that the IRT model used is robust to the 
violation of unidimensionality assumption. Nevertheless, in such case, 
multidimensional IRT models may yield better results than the unidimensional 
model. Therefore, dimensionality of the test should be carefully inspected or 
defined via theoretical review, content analysis, or factor analysis so IRT 
item and person parameters can be better estimated and used in equating. 

For some other minimum competency test, if guessing effect is considered 
not serious, then the use of Rasch model or 2 PL IRT model may be better 
alternatives to the 3PL IRT model. More investigations are needed for the 
data-model fit of IRT parameter estimation, since the estimation results may 
have substantial impacts on equating accuracy. 

Beyond the current study, it will be intriguing to investigate functions 
of various equating methods when test forms become longer or the number of 
anchor items increases. Cross-year equating can also be conducted to examine 
the effects of equating over time. If possible, validation study can also be 
carried out to further determine equating accuracy by correlating equating 
outcomes to the testing outcomes of some other examinations that need no 
equating . 
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Appendix A 

Item Sampling Schemes for Shorter Test Forms 

Simple Random Sampling 



Assumption 

Items from different sub-content areas do not differ substantially, since all of them are 
written for one single content domain (medicine-related) . 

Method 

Pool and mix items from all of the 23 sub-content areas to form a big item pool. Then, 
randomly sample items from the pool using a random number table. 

Results 

One pair of shorter alternate test forms, each consisting 60 items. There are 30 anchor 
items in each test form. 



Equal-Weight Domain Random Sampling 

Assumption 

Each of the 23 sub-content areas represents an important part of the medical content 
domain, and the 23 areas are of equal importance. 

Method 

For the first test form, sample three items from each of the 23 sub-content areas, 
regardless of the size of these areas. To have anchor items spread evenly across various areas 
and to account for the fact that there are more anchor items in the big item pool, whenever it is 
possible, two anchor items and one none-anchor item are randomly drawn from each of the areas. 
Use the anchor items sampled for the first test form as the anchor items of the second test form, 
and randomly sample one none-anchor item from each of the content areas to make up the entire 
second test form. 

Result 

A pair of alternate test forms, each consisting 69 items. For each test form, there are 49 
anchor items and 20 none-anchor items. 



Proportional- weight Domain Random Sampling 

Assumption 

The size of a sub-content area reflects its importance, that is, the more items a sub- 
content area has, the more important the area is to the medical content domain. 

Method 

From each of the 23 sub-content areas, randomly sample a number of items that is 
proportional to the size of the sub-content area. The sampling procedure is illustrated below in 
more details: 



Content Area size # of items 

Area (total # of items) % to be sampled 




1 13 5.8 

2 23 10.2 



5.8 * 60= 3.48 =4 
10.2*60= 6.12 =6 




3 


3 


1.3 


• 1.3*60 = 


0.78 =1 


4 


14 


6.2 


6.2 * 60 = 


3.72 =4 


5 


5 


2.2 


2.2 * 60 = 


1.32 =1 


6 


19 


8.4 


8.4 * 60 = 


5.04 =5 


7 


5 


2.2 


2.2 * 60 = 


1.32 =1 


8 


3 


1.3 


1.3*60 = 


0.78 =1 


9 


6 


2.7 


2.7 * 60 = 


1.62 =2 


10 


9 


4.0 


4.0 * 60 = 


2.40 =2 


11 


8 


3.6 


3.6 * 60 = 


2.16 =2 


12 


4 


1.8 


1.8*60 = 


1.08 =1 


13 


13 


5.8 


5.8 * 60 = 


3.48 =4 


14 


7 


3.1 


3.1 *60 = 


1.86 =2 


15 


5 


2.2 


2.2 * 60 = 


1.32 =1 


16 


15 


6.7 


6.7 * 60 = 


4.02 =4 


17 


13 


5.8 


5.8 * 60 = 


3.48 =4 


18 


25 


11.1 


11.1 *60 = 


6.66 =7 


19 


8 


3.6 


3.6 * 60 = 


2.16 =2 


20 


5 


2.2 


2.2 * 60 = 


1.32 =1 


21 


4 


1.8 


1.8*60 = 


1.08 =1 


22 


9 


4.0 


4.0 * 60 = 


2.40 =2 


23 


9 


4.0 


4.0 * 60 = 


2.40 =2 



Total 225 100.0 60 



Result 

A pair of alternate test forms, each consisting 60 items. In each form, there are 40 anchor 

items. 



Purposeful Sampling 

Assumption 

The more items a sub-content area has, the more important the area is, and the 23 sub- 
content areas differ in their content to a somewhat degree. In other words, test form involving a 
smaller number of content areas will be more homogeneous in content. 

Method 

Include all of the items in the largest three content areas, and disregard any items in the 
rest of the areas. 

Result 

For one test form, 45 anchor items and 15 none- anchor items are included. And, for the 
other test form, there are 45 anchor items and 12 none-anchor items. 





Appendix B 

Correlation Analyses on Anchor and None-anchor Items 
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Appendix C 

Correlation Matrix for Evaluating Equating Accuracy, with a Control of Auto-Correlation 
(Index of accuracy— Pearson r between 'pseudo true score' and true score estimate for none-anchor items only) 
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Appendix D 

Correlation Matrix for Reliability and Validity of Anchor Items 
(Pearson r between 'pseudo true score' and true score estimate using anchor items only) 
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Note: All of the Pearson correlation coefficients are significant at a=.01. 
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